library(tidyverse)
library(janitor)Cleaning The Data
It’s very rare that a dataset comes in the exact format that you want it. There might be unnecessary/distracting columns (as you’ll soon see with the Texas Lege data), empty cells, confusing column names, etc. This is why it’s always good practice to “clean” the data and get it in proper shape before you start analyzing it. How do you clean it? With code! You can use simple lines of code to rename columns, remove or select columns, merge data from separate tables together, and more.
Before we get into that, let’s open the README file to learn about the variables included in each dataset.
Read about each CSV
Click on the README.md file on the right side to open the document. It should appear in the top left window. If you want to collapse the console/terminal for a better view, click on the small box in the top right corner of the console window.
Now you can scroll through the README and learn more about each dataset, including the column names and a description of what the data in that column is. The README comes directly from the developers at LegiScan; this is a nice explanatory document that many databases will provide with raw data, but it isn’t always included.

As I said in the last section, we’re only going to be working with the “bills,” ”people,” and ”sponsors” datasets, so definitely take a look at those. If you want to close the document, click on the small”x” next to the file name (right above the toolbar). But you can also keep it open as we work in our cleaning file.
Setup
Before we start coding, we need to install and load the packages that are going to help facilitate the data analysis. You don’t need to worry about what each package does, but this is an important setup step that allows us to use specific functions later on. I always include tidyverse and janitor at the top of every R file.
If you want to learn more about what each of those packages does, you can read about them here and here. But again, not super necessary at the moment.
Installing the packages
Expand your console at the bottom of the page, type the following command (in the console, not the terminal), and hit enter. Make sure the name of the package is in quotes, or you’ll get an error!
install.packages("tidyverse")
It’s going to immediately start spewing out a bunch of scary red text – don’t worry, that actually means it’s working. It’s installing all of the tidyverse packages (and there are quite a few) to the project so you can run them later on. It may take a minute or two to finish. You’ll know it’s done if you see the message “The downloaded source packages are in” and this symbol reappears: >
Next, do the same thing for janitor.
install.packages("janitor")
This one shouldn’t take as long.
Load them into the cleaning file
Great! Click your cleaning.qmd file to open it up. At the top, write a little description to remind your future self what this file was for, something like “In this file I cleaned the bills, people, and sponsors data from LegiScan and merged the three together.”
Next, insert a code chunk (see tip below). Throughout this lesson, I recommend writing a sentence above every code chunk that explains what the code does; it’s a good way of tracking your work for your future self. For example, above the chunk you just added write “I’m loading the tidyverse and janitor packages to this file.”
Add a “code chunk” to your file by hitting Command+Option+i on the keyboard. For PC users, it’s ctrl + alt + i. You’ll run all of your code inside these little blocks, so get used to this shortcut!
Copy and paste (or type) the code below and “run” it by hitting the green arrow in the top right of your code chunk. As you can see below, a message pops up when you run it. You don’t need to pay much attention to that; as long as it doesn’t say “Error” you’re fine to move on.
To run code without clicking on the green arrow, hit Command + Enter on the keyboard. Another good one to get used to.
And that’s all you need to worry about for setup!
posit.cloud does not auto save!! So make sure to hit Command + S/ctrl + S frequently to save your work as you go.
Load the raw data
Next, you need to tell R that you want it to read the csv files from your data-raw folder so they appear below. The function to do this is aptly named read_csv().
Create a new code chunk and add a sentence above it explaining that you’re importing the three csv files.
Run the code below to import the “bills” data into this file. No data should appear yet, but you’ll see a message from R pop up. By running this code, you’ve created an object called “bills” and have told R that you want to read the csv and save that data inside the new object.
bills <- read_csv("data-raw/bills.csv")By putting “data-raw/bills.csv” inside the function, you’re basically telling R: inside the data-raw folder, read the file bills.csv
In the same code chunk, hit Enter twice so you’re on a new line. Type bills and run the code again. You should see a table with 14 columns and thousands of rows of data! Click through it.
bills Repeat these two steps for the “sponsors” data and “people” data.
Use read_csv to import the data and save it in an object. Run that code.
On a new line, type the name of the object and run it so the table appears.
Remember to leave descriptions of your work and save as you go!
sponsors <- read_csv("data-raw/sponsors.csv")sponsors lawmakers <- read_csv("data-raw/people.csv")lawmakersClean the tables
You probably noticed while clicking through that these tables aren’t as clean as they could be. Some of the column names don’t make sense, there are a lot of “NA” cells, and some of the information is just unnecessary for what we want to look at in this project. Let’s go through each table one at a time and clean them up.
1. Bills
If you go back and look at the bills table, you can see it includes a lot of different columns, some of which are just links. We’re going to use the select() function to remove those columns. We can also take out the session_id column because we know all of this data is from the 89th legislative session, so that value is going to be the same for every entry.
Create a new code block, name it, and then copy and run the code below. By putting a minus sign in front of the column name, you’re telling R: don’t include this column in the new table.
bills_clean <- bills |>
select(-session_id, -url, -state_link) bills_cleanNotice how I saved this in a new object called “bills_clean.” This means the untouched data still exists in the object “bills” if you ever wanted to run that and view it again. It’s good practice to create new objects as you change the data and rename it to something that makes sense.
Now you can click through the table to see that the url, state link, and session id data is gone.
2. Lawmakers (people)
Looking at the people data, which we’ve saved as “lawmakers,” you can see there are a lot of specific ids for other open data portals, like Follow the Money, Open Secrets, Ballotpedia, and KnowWho. While these are cool to have, we don’t need them for the analysis we’re running today. If you were writing a story about legislative campaign funds in Texas, for example, you might chose to keep the Follow the Money column.
To keep things simple (and to make our table a little smaller) let’s remove those columns. We’re still going to use the select() function by in the opposite way, this time telling R which columns you do want to include.
lawmakers_clean <- lawmakers |>
select(people_id, name, middle_name, suffix, party, role, district) lawmakers_cleanYou could do this the same way we did for bills, which was use select(-) and tell R the names of the columns you want to remove. It doesn’t really matter, I just wanted to demonstrate both options.
3. Sponsors
Sponsors is a much smaller data table, with only three columns. As noted in the README, “position” refers to where a lawmaker appears on the sponsor list for a specific bill. For example, if a lawmaker is the primary sponsor for a bill, their position number would read “1” … if they were the second person to sponsor that bill, it would read “2” … etc.
All three of these columns will be useful to us, so we’re not going to remove anything.
However, this is a good chance to try the rename() function, which allows you the change the name of a column. Personally, I want to remind myself that the position number is associated with sponsor order, so I want to rename the column to “spons_position”
sponsors_clean <- sponsors |>
rename(spons_position = position)sponsors_cleanNow you can see the new column name when you run the table.
If you want to choose a name that makes more sense to you, use this convention in the code: rename([new column name] = [original column name])
Merge the tables together
Now that we’ve cleaned up each table, we’re going to combine everything together into one big table to use for our analysis. We’ll do this using the merge() function.
Since we’re combining three different tables, we’ll do this in two steps. First, copy and run the code below (in a new code chunk) to merge the lawmaker and sponsor tables together. I’ve created a new object called “law-spons”
law_spons <- merge(lawmakers_clean, sponsors_clean) law_sponsGreat! That’s most of the data we need, but we’re still missing some information from the bills data. Run the code below in a new chunk to add it.
I’ve also renamed the middle_name column to say mid_init because that’s more accurate, and I’m picky.
final_table <- law_spons |>
left_join(bills_clean) |>
rename(mid_init = middle_name) Joining with `by = join_by(bill_id)`
final_tableAmazing! You should have one big data table with 19 different columns:
people_id = the unique number associated with each lawmaker
name = lawmaker’s first and last name
mid_init = lawmaker’s middle initial
suffix = lawmaker’s suffix, if applicable
party = political party of each lawmaker
role = whether the lawmaker is a senator or a representative
district = the congressional district associated with each lawmaker
bill_id = the unique number associated with a specific bill filed this session
bill_number = standard bill naming in the legislature
spons_position = for a specific bill, this number indicates where a lawmaker is in the line of sponsors
status_desc = each bill’s progress in the session (as of date of download); No progress, Introduced, Engrossed, or Passed
status = numerical representation of a bill’s status
status_date = date a specific bill achieved the corresponding status
title = shorter description of bill as written in LegiScan
description = longer description of bill as written in LegiScan
committee = which Senate/House committee the bill was referred to
committee_id = unique code associated with each committee
last_action = most recent update to bill as of the date you downloaded the data
last_action_date = calendar date of that update
Phew! This is a lot of data to work with. We may not end up using everything, but it’s still important to have a full picture of the data available. The table has over 10,000 rows of data, each corresponding with a bill/joint resolution. Now you can see the appeal of sorting through everything in R, rather than manually or in a spreadsheet.
There aren’t actually that many unique bills in session because some of because some of those are “companion bills,” meaning two identical bills, one filed in the House and one filed in the Senate.
Save the clean table
Before we move on to the analysis, you need to save the final table to your data-processed folder. To do this, you’re going to tell R to create an Rds file, which saves final_table object you created earlier. Use the write_rds function, and inside the parenthesis, include the location (data-processed folder) and a name for your new rds file. I chose to name mine “bills-89” but feel free to name yours something else as long as it makes sense. The code looks like this:
final_table |>
write_rds("data-processed/bills-89.rds")Nothing appears when you run it, but if you click into your data-processed folder, you should see the Rds file.
Have you saved your changes to this file? Hit Command/ctrl + S!
Recap
This section had a lot of information! Here are the key takeaways:
README files are helpful resources to learn what’s included in a dataset.
Setup is an important first step for running code. You learned how to install packages in your console and load those into your R file using the library() function.
You need to tell R which file you want it to read and spit out. Use the read_csv function and include a direct path to the location of your file.
Clean up data using a couple helpful functions:
Combine columns from different tables using merge()
Save objects in an Rds file using write_rds, and make sure to direct these to your data-processed folder.
In the next section, we’ll brain storm some questions about the Texas Legislative Session and try to answer them with some simple data computations.